A survey of cross-validation procedures for 
model selection 



July 27, 2009 



Sylvain Arlot, 
CNRS ; Willow Project-Team, 
Laboratoire d'Informatique de PEcole Normale Superieure 
(CNRS/ENS/INRIA UMR 8548) 
45, rue d'Ulm, 75 230 Paris, France 
Sylvain . ArlotOens . f r 

Alain Celisse, 
Laboratoire Paul Painleve, UMR CNRS 8524, 
Universite des Sciences et Technologies de Lille 1 
F-59 655 Villeneuve dSAscq Cedex, France 
Alain. Celisse@math.univ-lillel . f r 

Abstract 

Used to estimate the risk of an estimator or to perform model selec- 
tion, cross-validation is a widespread strategy because of its simplicity 
and its apparent universality. Many results exist on the model selection 
performances of cross-validation procedures. This survey intends to relate 
these results to the most recent advances of model selection theory, with a 
particular emphasis on distinguishing empirical statements from rigorous 
theoretical results. As a conclusion, guidelines are provided for choosing 
the best cross-validation procedure according to the particular features of 
the problem in hand. 
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1 Introduction 



Many statistical algorithms, such as likelihood maximization, least squares and 
empirical contrast minimization, rely on the preliminary choice of a model, that 
is of a set of parameters from which an estimate will be returned. When several 
candidate models (thus algorithms) are available, choosing one of them is called 
the model selection problem. 

Cross-validation (CV) is a popular strategy for model selection, and more 
generally algorithm selection. The main idea behind CV is to split the data (once 
or several times) for estimating the risk of each algorithm: Part of the data (the 
training sample) is used for training each algorithm, and the remaining part 
(the validation sample) is used for estimating the risk of the algorithm. Then, 
CV selects the algorithm with the smallest estimated risk. 

Compared to the resubstitution error, CV avoids overfitting because the 
training sample is independent from the validation sample (at least when data 
are i.i.d.). The popularity of CV mostly comes from the generality of the data 
splitting heuristics, which only assumes that data are i.i.d.. Nevertheless, the- 
oretical and empirical studies of CV procedures do not entirely confirm this 
"universality". Some CV procedures have been proved to fail for some model 
selection problems, depending on the goal of model selection: estimation or 
identification (see Section [2]). Furthermore, many theoretical questions about 
CV remain widely open. 

The aim of the present survey is to provide a clear picture of what is known 
about CV, from both theoretical and empirical points of view. More precisely, 
the aim is to answer the following questions: What is CV doing? When does 
CV work for model selection, keeping in mind that model selection can target 
different goals? Which CV procedure should be used for each model selection 
problem? 

The paper is organized as follows. First, the rest of Section Q] presents the 
statistical framework. Although non exhaustive, the present setting has been 
chosen general enough for sketching the complexity of CV for model selection. 
The model selection problem is introduced in Section [21 A brief overview of 
some model selection procedures that are important to keep in mind for un- 
derstanding CV is given in Section [3j The most classical CV procedures are 
defined in Section [H Since they are the keystone of the behaviour of CV for 
model selection, the main properties of CV estimators of the risk for a fixed 
model are detailed in Section 03 Then, the general performances of CV for 
model selection are described, when the goal is either estimation (Section or 
identification (Section [7]) . Specific properties of CV in some particular frame- 
works are discussed in Section [8l Finally, Section [9] focuses on the algorithmic 
complexity of CV procedures, and Section [10] concludes the survey by tackling 
several practical questions about CV. 

1.1 Statistical framework 

Assume that some data £i,...,£„ e 3 with common distribution P are ob- 
served. Throughout the paper — except in Section 18.31 — the & are assumed to 
be independent. The purpose of statistical inference is to estimate from the 
data (£i)i<j<„ some target feature s of the unknown distribution P, such as 
the mean or the variance of P. Let § denote the set of possible values for s. 
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The quality of t € S, as an approximation of s, is measured by its loss C(t), 
where L : § i— > K is called the Zoss function, and is assumed to be minimal for 
t = s. Many loss functions can be chosen for a given statistical problem. 
Several classical loss functions are defined by 

£(t)=£p(t):=E^p[ 7 (t;0] , (1) 

where 7 : S x 5 1— > [0, 00) is called a contrast function. Basically, for ( £ § 
and £ € H, 7(i;£) measures how well i is in accordance with observation of £, 
so that the loss of t, defined by {T}, measures the average accordance between 
t and new observations £ with distribution P. Therefore, several frameworks 
such as transductive learning do not fit definition (J). Nevertheless, as detailed 
in Section [T3j definition (TJ includes most classical statistical frameworks. 
Another useful quantity is the excess loss 

e(a,t):=Cp(t)-Cp(a)>0 , 

which is related to the risk of an estimator s of the target s by 

R(s) =E Cl ,...,£ B ~p[*(«, s)] . 



1.2 Examples 

The purpose of this subsection is to show that the framework of Section 11.11 
includes several important statistical frameworks. This list of examples does 
not pretend to be exhaustive. 



Density estimation aims at estimating the density s of P with respect to 
some given measure /i on S. Then, S is the set of densities on H with respect 
to (J,. For instance, taking j(t;x) = — ln(£(x)) in ([1]), the loss is minimal when 
t = s and the excess loss 



e{ s ,t) = Cp{t)-Cp( s )=E^p 



in 



£(0 



•An 



d/./ 



is the Kullback-Leibler divergence between distributions t\i and s/i. 



Prediction aims at predicting a quantity of interest Y £ y given an explana- 
tory variable X £ X and a sample of observations {X\,Y\), . . . , (X n ,Y n ). In 
other words, 5 = X x y, § is the set of measurable mappings X 1— » y and 
the contrast j(t; {x,y)) measures the discrepancy between the observed y and 
its predicted value t{x). Two classical prediction frameworks are regression and 
classification, which are detailed below. 



Regression corresponds to continuous y, that is y C K (or R k for multivari- 
ate regression), the feature space X being typically a subset of K . Let s denote 
the regression function, that is s(x) — E^.y^p [Y \ X = x], so that 



Vi, Y i = 8(X i ) + € i 



with 



Xi] = . 



A popular contrast in regression is the least-squares contrast 7 (t;(x,y)) 
(t(x) — y) 2 , which is minimal over § for t = s, and the excess loss is 

e(s,t)=E {XtY) „ P \(s(X)-t(X)) 
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Note that the excess loss of t is the square of the L 2 distance between t and s, 
so that prediction and estimation are equivalent goals. 



Classification corresponds to finite y (at least discrete). In particular, when 
y = {0,1}, the prediction problem is called binary (supervised) classification. 
With the 0-1 contrast function j(t; (x,y)) = l t ( x )^ y , the minimizer of the loss 
is the so-called Bayes classifier s defined by 

s(x) = \(x)>l/2 , 

where 77 denotes the regression function rj{x) = P(x,y)~p [Y = 1 1 X = x). 

Remark that a slightly different framework is often considered in binary clas- 
sification. Instead of looking only for a classifier, the goal is to estimate also the 
confidence in the classification made at each point: S is the set of measurable 
mappings X 1— ► M, the classifier x 1— > \(x)>o being associated to any t € §. 
Basically, the larger \t(x)\, the more confident we are in the classification made 
from t(x). A classical family of losses associated with this problem is defined by 
([1]) with the contrast 74, (t; (a;, y)) = <f> ( — (2y — l)t(x) ) where <fi : M 1— > [0, 00) 
is some function. The 0-1 contrast corresponds to 4>{u) = l u >o- The convex 
loss functions correspond to the case where <j> is convex, nondecreasing with 
lini-oo <j) = and (f>(0) = 1. Classical examples are <f)(u) — max{l + u, 0} 
(hinge), <j>(u) = exp(u), and (p(u) = log 2 (1 + exp(u)) (logit). The correspond- 
ing losses are used as objective functions by several classical learning algorithms 
such as support vector machines (hinge) and boosting (exponential and logit). 

Many references on classification theory, in cluding model selection, can be 
found in the survey by lBoucheron et al ] l|2005l l. 



1.3 Statistical algorithms 

In this survey, a statistical algorithm A is any (measurable) mapping A : 
U neN E7 l 1—* S. The idea is that data D n = (£,i) 1<i<n € 5" will be used as 
an input of A, and that the output of A, A(D n ) = ~s A (D n ) S S, is an estimator 
of s. The quality of A is then measured by Cp (s^ (£>„)), which should be as 
small as possible. In the sequel, the algorithm A and the estimator s^(Z) n ) are 
often identified when no confusion is possible. 

Minimum contrast estimators form a classical family of statistical algorithms, 
defined as follows. Given some subset S of S that we call a model, a minimum 
contrast estimator of s is any minimizer of the empirical contrast 

^ n 1 n 

ti->£ Pf ,(t) = -y)7(t;&), where P„ = - V % , 

i=l i=l 

over S. The idea is that the empirical contrast Cp n (t) has an expectation 
Cp (t) which is minimal over § at s. Hence, minimizing Cp n (t) over a set S of 
candidate values for s hopefully leads to a good estimator of s. Let us now give 
three popular examples of empirical contrast minimizers: 

• Maximum likelihood estimators: take j(t;x) = — ln(t(x)) in the density 
estimation setting. A classical choice for S is the set of piecewise constant 
functions on a regular partition of H with K pieces. 
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Least-squares estimators: take ^(t; (x, y)) = {t(x) — y) 2 the least-squares 
contrast in the regression setting. For instance, S can be the set of piece- 
wise constant functions on some fixed partition of X (leading to regresso- 
grams) , or a vector space spanned by the first vectors of wavelets or Fourier 
basis, among many others. Note that regularized least-squares algorithms 
such as the Lasso, ridge regression and spline smoothing also are least- 
squares estimators, the model S being some ball of a (data-dependent) 
radius for the L 1 (resp. L 2 ) norm in some high-dimensional space. Hence, 
tuning the regularization parameter for the LASSO or SVM, for instance, 
amounts to perform model selection from a collection of models. 

Empirical risk minimizers, following the terminology of IVapnik (1982): 
take any contrast function 7 in the prediction setting. When 7 is the 0-1 
contrast, popular choices for S lead to linear classifiers, partitioning rules, 
and neural networks. Boosting and Support Vector Machines classifiers 
also are empirical contrast minimizers over some data-dependent model 
S, with contrast 7 = 7</> for some convex functions <fi. 



Let us finally mention that many other classical statistical algorithms can 
be considered with CV, for instance local average estimators in the prediction 
framework such as ^-Nearest Neighbours and Nadaraya- Watson kernel estima- 
tors. The focus will be mainly kept on minimum contrast estimators to keep 
the length of the survey reasonable. 



2 Model selection 

Usually, several statistical algorithms can be used for solving a given statistical 
problem. Let (s\) XeA denote such a family of candidate statistical algorithms. 
The algorithm selection problem aims at choosing from data one of these algo- 
rithms, that is, choosing some X(D n ) e A. Then, the final estimator of s is given 
by %( D j(-D n ). The main difficulty is that the same data are used for training 

the algorithms, that is, for computing (sA(-D„)) AeA , and for choosing \{D n ) . 



2.1 The model selection paradigm 

Following Section ll.3[ let us focus on the model selection problem, where can- 
didate algorithms are minimum contrast estimators and the goal is to choose a 
model S. Let [S m ) meM be a family of models, that is, S m C 8. Let 7 be a 
fixed contrast function, and for every m 6 A4 n , let s" m be a minimum contrast 
estimator over model S m with contrast 7. The goal is to choose m(D n ) 6 M. n 
from data only. 

The choice of a model S m has to be done carefully. Indeed, when S m is a 
"small" model, s m is a poor statistical algorithm except when s is very close to 
S m , since 

e(s,s m )>m£ {£(s,t)}:=e(s,S m ) ■ 

The lower bound £(s,S m ) is called the bias of model S m , or approximation 
error. The bias is a nonincreasing function of S m . 
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On the contrary, when S m is "huge", its bias £(s,S m ) is small for most 
targets s, but s m clearly overfits. Think for instance of S m as the set of all 
continuous functions on [0, 1] in the regression framework. More generally, if 
S m is a vector space of dimension D m , in several classical frameworks, 

E[e(a,a m (D n ))]Kl(s,S m ) + \D m (2) 

where A > does not depend on m. For instance, A = l/(2n) in density 
estimation using the likelihood contrast, and A = cr 2 /n in regression using the 
least-squares contrast and assuming var(i^ | X) = a 2 does not depend on X. 
The meaning of J2]) is that a good model choice should balance the bias term 
£{s,S m ) and the variance term XD m , that is solve the so-called bias-variance 
trade-off. By extension, the variance term, also called estimation error, can be 
defined by 

E[i(s,s m (D n ))]-£(s,S m )=E[Cp(s m )}- M C P (t) , 

even when J2]) does not hold. 

The interested reader can fin d a much deepe r insight into model selection in 
the Saint-Flour lecture notes by lMassart ( 2007 ). 



Before giving examples of classical model selection procedures, let us mention 
the two main different goals that model selection can target: estimation and 
identification. 



2.2 Model selection for estimation 

On the one hand, the goal of model selection is estimation when s"rn,(r>„)(-Dn) 
is used as an approximation of the target s, and the goal is to minimize its 
loss. For instance, AIC and Mallows' C p model selection procedures are built 
for estimation (see Section [3~Tj) . 

The quality of a model selection procedure D n i— > m(D n ), designed for esti- 
mation, is measured by the excess loss of 'Sfh(D n ) (D n ). Hence, the best possible 
model choice for estimation is the so-called oracle model S m * , defined by 

r7i* = m k (D n ) £ arg min {£ (s, t m (D n ) ) } . (3) 

Since m*(D n ) depends on the unknown distribution P of data, one cannot 
expect to select m(D n ) = m*(D n ) almost surely. Nevertheless, we can hope to 
select fh(D n ) such that Sf?j(o„) 1S almost as close to s as 's m *{D n )- Note that 
there is no requirement for s to belong to U m eM 

Depending on the framework, the optimality of a model selection procedure 
for estimation is assessed in at least two different ways. 

First, in the asymptotic framework, a model selection procedure m is called 
efficient (or asymptotically optimal) when it leads to m such that 

I (s,?m(fl„)(J«)) g.s. ^ 

M meMn {£(s,S m (D n ))} rwoo 

Sometimes, a weaker result is proved, the convergence holding only in probabil- 
ity. 
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Second, in the non-asymptotic framework, a model selection procedure sat- 
isfies an oracle inequality with constant C„ > 1 and remainder term R n > 
when 

£(s,s MDn) (D n )) <C n inf {£(s,s m (D n ))} + R n (4) 

m£M„ 

holds either in expectation or with large probability (that is, a probability larger 
than 1 — C'/n 2 , for some positive constant C"). Note that if (@|) holds on 
a large probability event with C n tending to 1 when n tends to infinity and 
-Rn <C I (s, s m *{D n ) ), then the model selection procedure m is efficient. 

In the estimation setting, model selection is often used for bui lding adaptive 



estim ators, assuming that s belongs to some function space T a ([Barron et al 



1999) .Then, a model selection procedure fh is optimal when it leads to an estima- 



tor 'sfh(D n )(D n ) (approximately) minimax with respect to T a without knowing 
a, provided the family (S m ) rneM has been well-chosen. 

2.3 Model selection for identification 

On the other hand, model selection can aim at identifying the "true model" 
S mo , defined as the "smallest" model among {S m ) me j^ to which s belongs. 
In particular, s e UmeM ^ m ls assume d in this setting. A typical example of 
model selection procedure built for identification is BIC (see Section f373j) . 

The quality of a model selection procedure designed for identification is 
measured by its probability of recovering the true model mo. Then, a model 
selection procedure is called (model) consistent when 

¥{m(D n )=m ) > 1 . 

n — >oo 

Note that identification can naturally be extended to the general algorithm 
selection problem, the "true model" being replaced by t he stat i stical algorithm 
whose risk converges at the fastest rate (see for instance Yangl . 2007 ) . 



2.4 Estimation vs. identification 

When a true model exists, model consistency is clearly a stronger property than 
efficiency defined in Section [2~2l However, in many frameworks, no true model 
does exist so that efficiency is the only well-defined property. 

Could a model selection procedure be model consistent in the former case 
(like BIC) and efficient in the latter case (like AIC)? The general ans wer to this 
question, often called the AIC-BIC dilemma, is negative: Yangl ( 2005 ) proved in 



the regression framework that no model selection procedure can be simultane- 
ously model consistent and minimax rate optimal. Nevertheless, the strengths 
of AIC an d BIC can so metimes be shared; see for instance the i ntrod uction of 
a paper by lYangl (2005) and a recent paper by Ivan Erven et aL ( 20081 ). 



3 Overview of some model selection procedures 

Several approaches can be used for model selection. Let us briefly sketch here 
some of them, which are particularly helpful for understanding how CV works. 
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Like CV, all the procedures considered in this section select 



m{D n ) 



£ arg min {crit(m; D n ) } 

m6.M„ 



(5) 



where Vm S .M n , crit(m; £>„) = crit(m) € K is some data-dependent criterion. 

A particular case of ([5]) is •penalization, which consists in choosing the model 
minimizing the sum of empirical contrast and some measure of complexity of 
the model (called penalty) which can depend on the data, that is, 



m(D n ) e arg min { C Pn ( s m ) + pen(m; D n ) } 



(6) 



This section does not pretend to be exhaustive. Completely different approaches 
exist for model selection, such as the Minimum Description Length (MDL) 
( Rissanerl 1983h . and the Bayesian approaches. The interested reader will 
fin d more details and references on model se l ection procedures in the books 
by lBurnham and Anderson 1 20021 ) or Massart 1 2007 ) for instance. 

Let us focus here on five main categories of model selec t ion pr ocedures, the 
first three ones coming from a classification made by Shad 1 1997 ) in the linear 
regression framework. 



3.1 The unbiased risk estimation principle 

When the goal of model selection is estimation, many model selection pro- 
cedures are of the form J5|) where crit(m; D n ) unbiasedly estimates (at least, 
asymptotically) the loss Cp(? m ). This general idea is often called unbiased 
risk estimation principle, or Mallows' or Akaike's heuristics. 

In order to explain why this strategy can perform well, let us write the 
starting point of most theoretical analysis of procedures defined by |(5|): By 
definition ((Sj) , for every m € A4 n , 

£ (s,Sfn) + crit(m) - L P (%) < £(s,s m ) +crit(m) - C P (s m ) . (7) 

If E [crit(m) — Cp (s m )] = for every m 6 M. n , then concentration inequalities 
are likely to prove that e~ , e+ > exist such that 

4- crit(m) — Cp (s m ) 
VmeMn, e+> V mJ >-£->-l 

with high probability, at least when Card(Al rl ) < Cn a for some C, a > 0. Then, 
([7]) directly implies an oracle inequality like Q with C n = (1 + e+)/(1 — e~). If 
£ n } £ n ^0 when n — > oo, this proves the procedure defined by ([5J) is efficient. 

Examples of model selection procedures follo wing the unbia sed risk estima- 
tion principle are FPE (Final Prediction Error, Akaike, 1970| ). several cross- 



validation procedures including the Leave-one-out ( see S ection |H), and GCV 
(Generalized Cross- Validation, Craven and Wahba . 19791 . see Section I4.3.3[) . 



With the penalization approach J6]), the unbiased risk estimation principle is 
that E [pen(m)] should be close to the "ideal penalty" 

pen id (m) := £ P (s m ) - C Pn (s m ) . 

Several classical penalization procedures follow this principle, for instance: 
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With the lo g-likelihood contrast, AIC (Akaike Information Criterion , 
AkaikJ 1973f l and its corrected versions ( Sugiura . 19781 : Hurvich and Tsai 



1989). 



With the least-squares contrast, Mallo ws' Cr, llMallowsl . [l97l and several 
refined versions of C p (see for instance iBaraudl . I2002T ) . 

With a general contrast, covariance penalties (|Efronl . [2Q0l . 



AIC, Mallows' C p and related procedures have been proved to be optimal 
for estimation in several frameworks , provided Card(AA r .) < Cn a for some 



constants C, a > (see the paper by iBirge and Massartl . 120071 and references 
therein) . 

The main drawback of penalties such as AIC or Mallows' C p is their depen- 
dence on some assumptions on the distribution of data. For instance, Mallows' 
C p assumes the variance of Y does n ot depend on X. Otherwise, it has a 



suboptimal performance (|Arlotl . l2008bl ) . 

Several resampling-based penalties have been proposed to overcome this 
problem, at the price of a larger computational complexit y, and possib ly slightly 
worse performance in simpler frame works; see a paper bv lEfronl (|l983h for boot- 
strap, and a paper by lArlotl ((2008a) and references therein for generalization to 
exchangeable weights. 

Finally, note that all these penalties depend on multiplying factors 
which are not always kn own (for instance, the noise-level, for Mallows' C p ). 
Birge and Massartl ( 2007 ) proposed a general data-driven procedure for estimat- 
ing such multiplying factors, which satisfies an oracle inequality with C n — > 1 
in regression (see also Arlot and Massart . 20091 ) . 



3.2 Biased estimation of the risk 

Several model selection procedures are of the form J5]) where crit(m) does not 
unbiasedly estimate the loss Cp (s" m ): The weight of the variance term com- 
pared to the bias in E [crit(m)] is slightly larger than in the decomposition (f2|) 
of Cp (s m ). From the penalization point of view, such procedures are overpe- 



Examples of such procedures are FPE ^, llBhansali and Downham , . Il977h and 
GICa (Generalized Information Criterion, Nishiil . 1984 : Shad . 19971 ) with a, A > 
2, which are closely related. Some cross-validation procedures, such as Leave- 
p-out with p/n S (0, 1) fixed, also belong to this category (see Section [4.3. 1(1 . 
Note that FPE Q with a = 2 is FPE, and GIC A with A = 2 is close to FPE and 
Mallows' C p . 

When the goal is estimation, there are two main reasons for using "biased" 
model selection procedures. First, experimental evidence show that overpenal- 
izing often yields better performance when the signal-to-noise ratio is small (see 
for instance lArlotl . l2007t Chapter 11). 

Second, when the number of models Card(A^„) grows faster than any power 
of 7i, as in the complete variable selection problem with n variables, then the 
unbiased risk estimation principle fails. From the penalization point of view, 



Birge and Massart (|2QQ7h proved that when Card(Al„) = e Kn for some K > 0, 
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the minimal amount of penalty required so that an oracle inequality holds with 
C n = 0(1) is much larger than pen id (?7i). In addition to the FPE Q and GICa 
with suitably chosen a, A, several penaliza tion procedures have been pr o posed 



for taking into ac c ount the siz e of M n I Barron et al . 1999; Ba raudL 20021 : 



Birge and Massartl 12001 : Sauve . 2009^ 1 . In the same papers, these procedures 
are proved to satisfy oracle inequalities with C n as small as possible, typically 
of order ln(n) when Card(.M n ) = e Kn . 



3.3 Procedures built for identification 

Some specific model selection procedures are use d for identifica tion. A typical 
example is BIC (B ayesian Inform ation Criterion, SchwarzL 1978h . 

More generally, Shao (| 1907T I showed that several procedures identify con- 
sistently the correct model in the linear regression framework as soon as they 
overpenalize within a factor tending to infinity with n, for instance, GICa„ with 
A n — > +oo, FPE an with a n — > +00 1 Shibata . 1984) ■ and several CV procedures 
such as Leave-p-out with p = p n ~ n. BIC is also part of this picture, since it 
coincides with GIC w„y 

In another paper, IShaol l|l996l l showed that m„-out-of-n bootstrap penaliza- 
tion is also model consistent as soon as m n ~ n. Compared to Efron's bootstrap 
penalties, the idea is to estimate pen id with the m n -out-of-n bootstrap instead 
of the usual bootst rap, which re sults in overpenalization within a factor tending 
to infinity with n () Arlotl . Eooiat) . 

Most MDL-based pro cedures ca n also be put into this category of model 
selection procedu res (see GrunwaldL 2007 ). Let us finally mention the Lasso 
( Tibshirarfil 1996h and other i 1 penalizati on procedures, which h ave recently 
attracted much attention (see for instance iHesterberg et all l2008t ) . They are 
a computationally efficient way of identifying the true model in the context of 
variable selection with many variables. 



3.4 Structural risk minimization 



In the context of statistical learning, Vapnik and Cheryonenkis lll974l) pro - 
posed the structural risk minimization approach (see also IVannikl . ll982Lll99Sft . 
Roughly, the idea is to penalize the empirical contrast with a penalty (over- 
estimating 

P en id, 9 ( TO ) : = SU P {£p (*) - £-P n (t)} > Pen id (m) . 
tes m 

Such penalties have been built using the Vapnik-Chervone nkis dimension, the 



combinatorial entrop y, (global) Rademacher complex ities feoltchinskiil . 2001 



Bartlett et al. . 2002), (global) bootstrap pe nalties llFromontl. l2007ll. Gaus 



nskiil . 
'07), 
lson, 



sian complexities or the maximal discrepancy IjBartlett and Mendelsonl . 2002). 
These penalties are often called global because pen id tg (m) is a supremum over 

The localization approach (see iBoucheron et all l2005t ) has been introduced 
in order to obtain penalties closer to pen id (such as local Rademacher com- 



plexities), hence s maller prediction errors when possible ( Bartlett et al. . l2005t 



Koltchinskiil . 200d ). Nevertheless, these penalties are still larger than pen id (m) 
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and can be difficult to compute in practice because of several unknown con- 
stants. 

A non-asym ptotic ana l ysis o f several global and loca l penalties can be fo und 
in the book by iMassart ( 2007 ) for instance; see also Koltchinskii ( 2006h for 
recent results on local penalties. 



3.5 Ad hoc penalization 

Let us finally mention that penalties can also be built according to particular 
features of the problem. For instance, penalties can be proportional to the £ p 
norm of s m (similarly to ^-regularized learning algorithms) when having an 
estimator with a controlled (P norm seems better. The penalty can also be 
proportional to the squared norm of s" m in some reproducing kernel Hilbert 
space (similarly to kernel ridge regression or spline smoothing), with a kernel 
adapted to the specific framework. More generally, any penalty can be used, as 
soon as pen(m) is larger than the estimation error (to avoid overfitting) and the 
best model for the final user is not the oracle m* , but more like 

arg min {£ (s, S m ) + repen(m) } 

for some K > 0. 



3.6 Where are cross-validation procedures in this picture? 

The family of CV procedures, which will be described and deeply investigated 
in the next sections, contains procedures in the first three categories. CV proce- 
dures are all of the form JHJ, where crit(m) either estimates (almost) unbiasedly 
the loss Cp (s m ), or overestimates the variance term (see Section |2~T|) . In the 
latter case, CV procedures either belong to the second or the third category, 
depending on the overestimation level. 

This fact has two major implications. First, CV itself does not take into 
account prior information for selecting a model. To do so, one can either add 
to the CV estimate of the risk a penalty term (such as ||s m || p ), or use prior 

information to pre-select a subset of models M(D n ) C M n before letting CV 
select a model among (S m ) m( =M(D„)- 

Second, in statistical learning, CV and resampling-based procedures are the 
most widely used model selection procedures. Structural risk minimization is 
often too pessimistic, and other alternatives rely on unrealistic assumptions. 
But if CV and resampling-based procedures are the most likely to yield good 
prediction performances, their theoretical grounds are not that firm, and too 
few CV users are careful enough when choosing a CV procedure to perform 
model selection. Among the aims of this survey is to point out both positive 
and negative results about the model selection performance of CV. 



4 Cross-validation procedures 

The purpose of this section is to describe the rationale behind CV and to define 
the different CV procedures. Since all CV procedures are of the form 
defining a CV procedure amounts to define the corresponding CV estimator of 
the risk of an algorithm A, which will be crit(-) in ([5]). 
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4.1 Cross-validation philosophy 

As noticed in the early 30s by [Larson] <j 1 9 3 lh . training an algorithm and evaluat- 
ing its statistical performance on the same data yields an overoptimistic result . 



CV was raised to fix this issue ( Mosteller and Tukey . 1968; Stone, 1974 : Geisse 



19751 ) , starting from the remark that testing the outp ut of the a l gorith m on new 
data would yield a good estimate of its performance ( Breimanl . 1998). 

In most real applications, only a limited amount of data is available, which 
led to the idea of splitting the data: Part of the data (the training sample) is 
used for training the algorithm, and the remaining data (the validation sample) 
is used for evaluating its performance. The validation sample can play the role 
of new data as soon as data are i.i.d.. 

Data splitting yields the validation estimate of the risk, and averaging over 
several splits yields a cross-validation estimate of the risk. As will be shown in 
Sections H21 and [Uni various splitting strategies lead to various CV estimates of 
the risk. 

The major interest of CV lies in the universality of the data splitting heuris- 
tics, which only assumes that data are identically distributed and the train- 
ing and validation samples are independent, two assumptions which can even 
be relaxed (see Section HO|) . Therefore, CV can be applied to (almos t ) any 
algorithm in (almost) any framew ork, for instance regression (jStonel . 19741 : 



Geisser 



1975h . density estimation llRudemo. Il982t IStoneL Il984f ) and classifi- 
cation ijDevrove and Wagner . 1979; iBartlett et all 12002? ). among many others. 



On the contrary, most other model sel ection procedu res (see Section [3]) are spe- 
cific to a framework: For instance, C p (<Mallowd . [l973h is specific to least-squares 



regression. 

4.2 From validation to cross-validation 

In this section, the hold-out (or validation) estimator of the risk is defined, 
leading to a general definition of CV. 



4.2.1 Hold-out 



The hold-out I Devroye and Wagner . 19791 ) or (simple) validation relies on a sin- 
gle split of data. Formally, let be a non-empty proper subset of { 1, . . . , n}, 
that is, such that both 7 (t) and its complement I (v) = (/ (t) ) c = { 1, . . . , n} \/W 
are non-empty. The hold- out estimator of the risk of A(D n ) with training set 
is defined by 



(8) 



ieD K n 



where '.= (Ci)je/<*) is the training sample, of size n t = Card(/«), and 

Dn := (£i)ie/M is the validation sample, of size n v = n — n t ; is called the 
validation set. The question of choosing n t , and /w given its cardinality nt, is 
discussed in the rest of this survey. 
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4.2.2 General definition of cross-validation 



A general description of the C V strategy has been given by iGeisserl (|1975h : In 
brief, CV consists in averaging several hold-out estimators of the risk corre- 
sponding to different splits of the data. Formally, let B > 1 be an integer and 



(*) 



Ijg be a sequence of non-empty proper subsets of { 1, 



',}. The CV 



estimator of the risk of A(D n ) with training sets ( 1^ ) is defined by 

V J ) Kj<B 



l<j<B 



1 B ~ 

B 



A- D ■ T 



(0 



(9) 



All existing CV estimators of the risk are of the form (J9j) , each one being uniquely 
determined by the way the sequence ( /„• ) is chosen, that is, the choice 

\ J J l<j<B 

of the splitting scheme. 

Note that when CV is used in mode l selec t ion for iden tification, an alterna- 
tive definition of CV was proposed by Yangl ( 2006I . 2007 ) and called CV with 
voting (CV-v). When two algorithms A% and A2 are compared, Ai is selected 
by CV-v if and only if ^^{Ai 



ra-oi 



(A2] D n ; if') for a majority 



^1 j 1 Kt - I j I ' C- , v - • • ■ / 

of the splits j = 1,...,B. By contrast, CV procedures of the form © can 
be called "CV with averaging" (CV-a), since the estimates of the risk of the 
algorithms are averaged before their comparison. 



4.3 Classical examples 

Most classical CV estimators split the data with a fixed size nt of the training 
set, that is, Card(ij*' ) ) w for every j. The question of choosing n t is discussed 
extensively in the rest of this survey. In this subsection, several CV estimators 
are defined. Two main categories of splitting schemes can be distinguished, 
given n%: exhaustive data splitting, that is considering all training sets Jw of 
size nt, and partial data splitting. 



4.3.1 Exhaustive data splitting 

Leave-one-out (LOO, IStond . Il974j : lAllenl . Il974j : IGeisserl . Il975h is the most 
classical exhaustive CV procedure, corresponding to the choice n t = n — 1 : 
Each data point is successively "left out" from the sample and used for validation. 
Formally, LOO is defined by (J9]) with B = n and 1^ = { j } c for j = 1, . . . , n : 



1 n 

^°(A;D n ) = ^( A { Di n j) )^j) 



(10) 



where Dn 



(£i_L^4j- The name LOO can be t r aced back to papers by 



Picard and Cook] 1 1984H and by Breiman and Spector ( 1992 ) , but LO O has sev- 



eral other names i n the literature , such as delete-one CV (se e 



Li 



-J1987L .. 

CV ijStond . Il974t iBurmanl . Il989h . or even only CV l|Efronl . Il983l : iLil. Il987h 



, ord inary 
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Leave-p- out (LPO, IShaol . Il993h with p £ {1, . . . , ri } is the exhaustive CV 
with n t = n — p : every possible set of p data points are successively "left out" 
from the sample and used for validation. Therefore, LPO is defined by ((9J with 

',} of size p. LPO is also 



and {if )i<j<B are all the subsets of {1, 



B = 

called ^delete-p CV or delete-p multifold CV (|Zhangl . Il993h . Note that LPO with 
p = 1 is LOO. 



4.3.2 Partial data splitting 

Considering (™) training sets can be computationally intractable, even for small 
p, so that partial data splitting methods have been proposed. 

V-fold CV (VFCV) with V £ { 1, . . . , n} was introduced bv lGeisserl lll975ll as 



an al ternative to the computationally expensive LOO (see also lBreiman et al 



1984 for instance) . VFCV relies on a preliminary partitioning of the data into V 
subsamples of approximately equal cardinality n/V; each of these subsamples 
successively plays the role of validation sample. Formally, let A 1: . . . ,Ay be 
some partition of { 1, . . . , n} with Card ( Aj ) « n/V. Then, the VFCV estimator 



of the risk of A is defined by |(9|) with B 
that is, 



V and I 



(*) _ 



A c - for j = 1,...,B, 



C 



s:D n :(A 



3 h<j<v 



V ^ 



Card(A,-) 



(11) 



where Dn A] ' 



(£ i )ieA c - By construction, the algorithmic complexity of 
VFCV is only V times that of training A with n — n/V data points, which 
is much less than LOO or LPO if V < n. Note that VFCV with V = n is LOO. 



Balanced Incomplete CV (BICV. IShaol . ll993h can be seen as an alternative 
to VFCV well-suited for small training sample sizes n t . Indeed, BICV is defined 
by © with tr aining sets ( A c ) 4g7 -, where T is a balanced incomplete block 
designs (BIBD, John . 197lh . that is, a collection of B > subsets of { 1, . . . , n} 
of size n v — n — n t such that: 



1. Card{^4 £ T s.t. k e A} does not depend on k £ {1, . . . ,n}. 

2. Card{^leT s.t. k, I £ A) does not depend on k ^ I £ { 1, . . 



The idea of BICV is to give to each data point (and each pair of data points) 
the same role in the training and validation tasks. Note that VFCV relies on a 
similar idea, since the set of training sample indices used by VFCV satisfy the 
first property and almost the second one: Pairs (k, £) belonging to the same Aj 
appear in one validation set more than other pairs. 



Repeated learning-t esting (RLT) was intr oduced bvlBreiman et al.l (|1984h 
and further studied by iBurmarj (1983) an d 

bv lZhangl (|l993h for instance. The 



RLT estimator of the risk of A is defined by |(9]) with any B > and (ij )i<j<B 
are B different subsets of { 1, . . . , n}, chosen randomly and independently from 
the data. RLT can be seen as an approximation to LPO with p — n — n t , with 
which it coincides when B = (™) . 
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Monte-Carlo CV fMCCV. IPicard and Cool Il984l ) is very close to RLT: B 
independent subsets of {1, ... , n.} are randomly drawn, with uniform distribu- 
tion among subsets of size nj. The only difference with RLT is that MCCV 
allows the same split to be chosen several times. 



4.3.3 Other cross-validation-like risk estimators 

Several procedures have been introduced which are close to, or based on CV. 
Most of them aim at fixing an observed drawback of CV. 



Bias-cor rected versions of V FCV and RLT risk estimators have been pro- 
posed by Burmanl 1 19891 1990[ ). and a closely related pena lization procedure 
called y-fold penalization has been defined by lArlot ( 2008c ). see Section 15.1.21 
for details. 



Generalized CV (GCV, ICraven and Wahbal . Il979h was introduced as a 
rotation-invariant version of LOO in least-squares regression, for estimating the 
risk of a linear estimator ? = MY where Y = (li)i<i< n G K™ and M is an 
n x n matrix independent from Y: 



crit GCV (Af,Y) := 



I Y - MYII 



(1 



tr(M); 



where \/t € 



i=l 



GCV is actually closer to Cl (|Mallowsl . fl973h than to CV, since GCV can be 
seen as an a pproximation to Cl with a particular estimator of the variance 
( Efron . 1986t ). The effici e ncy o f GCV has been proved in vario us frameworks, 
in particular byO (|l985l . ll987tl and bv lCao and Golubevl (|2006h . 



An alytic Approximation When CV is used for selecting among linear mod- 
els, [Shac] (1993) proposed an analytic approximation to LPO with p ~ n, which 
is called APCV. 



LOO bootstrap and .632 bootstrap The bootstrap is often used for stabi- 
lizing an estimator or an algorithm, replacing A{D n ) by the aver age of A(D*) 
over several bootstrap resamples D* n . This idea was applied by lEfronl (jl983) 
to the LOO estimator of the ri sk, leading t o the LOO bootstrap. Noting that 
the LOO bootstrap was biased, lEfronl |l983) gave a heuristic argument leading 
to the . 632 bootstrap estimato r of th e risk, later modified into the .632+ boot- 
strap hvlEfro n and Tibshiranil (|l997l ). The main drawback of these procedures 
is the weakness of their theoretical justifications. Only empirical studie s have 
supported the good b ehaviour of .632+ bootstrap ([Efron and Tibshiranil . 11997 



Molinaro et all 120051 ) 



4.4 Historical remarks 

Simple validation or hol d-out was the fi rst CV-like procedure. It was introduced 
in the psychology area (jLarsonl . Il93lh from the need for a r eliable alternative 
to the resubstitut ion error x as illu strated by lAnderson etHI (|l972h . The hold- 
out was used by Herzbergl 1 1969h for assessing the qualit y of predicto rs. The 
problem of choosing the training set was first considered by Stone! (1974), where 
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"controllable" and "uncontrollable" data splits were di stinguished; a n instance 
of uncontrollable division can be found in the book by 

A primitive LOO proced ure was used by 
Lachenbruch and Mickey (1968) for evaluating the error rate of a predic- 




and by 



tion rule, and a primitive formulation of LOO can be found in a paper by 
Mosteller and Tukevl 1119681 ). N everth e less, L OO was actuall y intro duced inde- 



pendentlv bv lStond (jl974h . by lAllenl (Il974h and by iGeisserl l|l975l ). The rela- 
tionship between LOO and the jackknife (| Quenouilld . 1 1 9491 ) . which both rely on 
the idea of r emoving one observation from the sample, has been discussed by 
Stone! l|l974 ) for instance. 

The hold-out and CV were originally used only for estimating the risk of an 
algorithm. The idea of using CV fo r model selection arose in the di scussion of 
a paper bv lEfron and Morris! (|l973h and in a paper bv lGeissedlll974h. Th e first 
author to study LOO as a model selection procedure was IStond l|l974h . who 
proposed to use LOO again for estimating the risk of the selected model. 



5 Statistical properties of cross-validation esti- 
mators of the risk 

Understanding the behaviour of CV for model selection, which is the purpose 
of this survey, requires first to analyze the performances of CV as an estimator 
of the risk of a single algorithm. Two main properties of CV estimators of the 
risk are of particular interest: their bias, and their variance. 



5.1 Bias 

Dealing with the bias incurred by CV estimates can be made by two strategies: 
evaluating the amount of bias in order to choose the least biased CV procedure, 
or correcting for this bias. 



5.1.1 Theoretical assessment of the bias 



The independence of the training and the validation samples imply that for 
every algorithm A and any iw C { 1, . . . , n} with cardinality n t , 



E 



C 



H -° (A; 



D -jW 



E 



l(A(D^)^)] =E[C P (A(D nt ))] 



Therefore, assuming that Card(/j t ' ) ) = n t for j = 1,...,B, the expectation of 
the CV estimator of the risk only depends on nt : 



E 



C cv [ A;D n ;(I. 



At) 



l<j<B 



= E[C P (A(D nt ))} 



(12) 



In particular (jT2j) shows that the bias of the CV estimator of the risk of A is 
the difference between the risks of A, computed respectively with n t and n data 
points. Since n t < n, the bias of CV is usually nonnegative, which can be proved 
rigorously when the risk of A is a decreasing function of n, that is, when A is a 
smart rule; note however that a classical algorithm such as 1-nearest-neighbour 
in classification is not smart i Devrove et al. . 19961 . Section 6.8). Similarly, the 
bias of CV tends to decrease with n t , which is rigorously true if A is smart. 
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More precisely, lfT2|) has led to several results on the bias of CV, which can be 
split into three main categories: asymptotic results (A is fixed and the sample 
size n tends to infinity), non-asymptotic results (where A is allowed to make use 
of a number of parameters growing with n, say n 1 / 2 , as often in model selection), 
and empirical results. They are listed below by statistical framework. 



Regression The general behaviour of the bias of CV (positive, decreasing 
with nt) is confirmed by several papers and for several C V estimators. For 
LPO, non-asymptotic expressi ons of its bias were prove d bv lCelissd (|2008bf) for 
projection estimators, and by lArlot and Celisse! (|2009h for regressograms and 
kernels estimators when the design is fix ed. For VFCV a nd RLT, an asymptotic 
expansion of their bias was yielded by iBurmanl 1 19891 ) for le ast-squares est i- 
mators in linear r egression, an d extended to spline smoothing ( Burmanl 199(1 ) . 
Note finally that Efronl 1 1986t ) proved non-asymptotic analytic expressions of 
the expectations of t he LOO and GCV estimators of the risk in regression with 
binary data (see also Efronl . 19831 for some explicit calculations) . 



Density estimation shows a similar picture. Non-asymptotic expressions 
for the bias of LPO estimato rs for kernel and project ion esti mators with th e 
quadratic risk were proved by Celisse and Robin] ( 2008) and by Celissel (2008a). 



Asymptotic expansions of the bias of the LOO est i mato r for histograms and ker- 
nel estimato rs were prey iously proved bv lRudemd (|1982f k see lBowmanl |l98J) for 
simulations. Halll (1987) derived similar results with the log-likelihood contrast 
for kernel estimators, and related the performance of LOO to the interaction 
between the kernel and the tails of the target density s. 



Classification For the simple problem of discriminatin g between two popula- 
tions with shifted distributions, Davison and Hal] 1 1992f ) compared the asymp- 
totical bias of LOO and bootstrap, showing the superiority of the LOO when 
the shift size is n -1 / 2 : As n tends to infinity, the bias of LOO stays of or- 
der n~ l , whereas that of bootstr ap worsens to the ord er rt -1 / 2 . On realistic 
synthetic and real biological data, Molinaro et alJ ( 2005 ) compared the bias of 
LOO, VFCV and .632+ bootstrap: The bias decreases with nt, and is generally 
minimal for LOO. Nevertheless, the 10-fold CV bias is nearly minimal uniformly 
over their experiments. In the same experiments, .632+ bootstrap exhibits the 
smallest bias for moderate sample sizes and small signal-to-noise ratios, but a 
much larger bias otherwise. 



CV-calibrated algorithms When a family of algorithm {A\) XeA is given, 

and A is chosen by minimizing C CY (A\;D n ) over A, C cv (A^;D n ) is biased 
for estimating the risk of A^(D n ), a s reported from simul ation experiments 
bv IStond (|l974h for the LOO, and bv I Jonathan et all (<2000h for VFCV in the 
variable selection setting. This bias is of different nature compared to the pre- 
vious frameworks. Indeed, £ (A-?,D n ) is biased simply because A was chosen 
using the same data as C (A\, D n ). This phenomenon is similar to the op- 
timism of Cp n (?(£>„)) as an estimator of the loss of s~(D„). The correct 
way of estimating the risk of A^(D n ) with CV is to consider the full algorithm 
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A 1 : D n i — > A^^ D ■j(-Dn), and then to compute C {A 1 \ D n ). The resulting 
procedure is called "double cross" bv lStond (|1974| ). 

5.1.2 Correction of the bias 

An alternative to choosing the CV estimator with the smallest bias is to correct 
for the bias of the CV estimator of the risk. Burman proposed a 

corrected VFCV estimator, defined by 

, v 



3 = 1 

and a corrected RLT estimator was defined similarly. Both estimators have 
been proved to be asymptotically unbiased for least-squares estimators in linear 
regression. 

When the AjS have exactly the same size n/V, the corrected VFCV c riterio n 
is equal to the sum of the empirical risk and the V-fold penalty (|Arlotl . 12008a ) , 
defined by 

v 



pen VF (.4; D r , 



V-l 
V 



C Pn [A(D. 



A(D 



The V-fold penalized criterion was proved to be (almost) unbiased in the non- 
asymptotic framework for regressogram estimators. 

5.2 Variance 

CV estimators of the risk using training sets of the same size n t have 
the same bias, but they still behave quite differently; their variance 
var(£ cv (^;^ n ;(/f ) i<i<s)) captures most of the information to explain these 
differences. 



5.2.1 Variability factors 



Assume that Card(/j t ' ) ) = n t for every j. The variance of CV results from the 
combination of several factors, in particular (n t ,n v ) and B. 

Influence of (n t , n v ) Let us consider the ho ld-out estimator of the risk. Fol- 
lowing in particular iNadeau and Bengi 3 (|2003h . 



var 



E 



£*-°(A;D n ;lV) 
varf£ p( „, (a(D<P) 



+ var[Cp(A(D nt ))} 



= — E 

n„ 



var( 7 (s,0l s =i(DW))] + var [C P (A(D nt ) )} 



(13) 



The first term, proportional to l/n v , shows that more data for validation 
decreases the variance of i? 1-0 , because it yields a better estimator of 



Cp (a(D$)j. The second term shows that the variance of C^ ° also depends 
on the distribution of Cp around its expectation; in particular, it 



strongly depends on the stability of A. 
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Stability and variance When A is unstable , jT^ 00 (A) has often been 



point ed out as a variable estimator (Section 7.10, lHastie et al.l . l2001tlBreimanl . 
1996). Conversely, this trend disappears when A is stable, as noticed by 



Moli naro et al. l|2005l l From a simulation experiment. 

The relation between the stability of A and the variance of £ cv (.4) was 
pointed out by iDevroye and Wagner! (|l979l) in classification , thro ugh upper 
bounds on the variance of f LO ° {A). iBousquet and Elissefll J2002) extended 
these results to the regression setting, and proved upper bounds on the maxi- 
mal upward deviation of C^ 00 (A). 

Note finally that several approaches based on the bootstrap have been pro- 
posed for reducing the varianc e of £^°° (A ), such as LOO bootstrap, .632 
bootstrap and .632+ bootstrap 1 Efronl . 19831 ): see also Section 11. 3.31 



Partial splitting and variance When (nt,n v ) is fixed, the variability of 
CV tends to be larger for partial data splitting methods than for LPO. Indeed, 
having to choose B < (™) subsets (J- ^)i<j<B of {1, ... ,n}, usually randomly, 
induces an additional variability compared to £ LPO with p — n — n t . In the 
case of MCCV, this variability decreases like B^ 1 since the 1^ are chosen 
independently. The dependence on B is slightly different for other CV estimators 
such as RLT or VFCV, because the /j*' are not independent. In particular, it 
is maximal for the hold-out, and minimal (null) for LOO (if n t = n — 1) and 
LPO (with p = n — n t ). 

Note that the dependence on V for VFCV is more complex to evaluate, since 
B, n t , and n v simultaneously vary with V. Nevertheless, a non-asymptotic the- 
or etical quantification of thi s additional variability of VFCV has been obtained 
by lCelisse and Robinl ( 2008) in the density est imation framework (see also em- 
pirical considerations by Jonathan et al. . 2000h . 



5.2.2 Theoretical assessment of the variance 

Understanding precisely how var(£ cv (*4)) depends on the splitting scheme is 
complex in general, since nt and n v have a fixed sum n, and the number of splits 
B is generally linked with n t (for instance, for LPO and VFCV). Furthermore, 
the variance of CV behaves quite differently in different frameworks, depending 
in particular on the stability of A. The consequence is that contradictory results 
have been obtained in different frameworks, in particular on the v alue of V 
for which the VFCV est imator of the risk has a minimal variance ijBurmanl . 
19891 : Hastie et al . 2001 . Section 7.10). Despite the difficulty of the problem, 



the variance of several CV estimators of the risk has been assessed in several 
frameworks, as detailed below. 



Regression In the linear regression setting, iBurman (1989) yielded asymp- 
totic expansions of the variance of the VFCV and RLT estimators of the risk 
with homoscedastic data. The variance of RLT decreases with B, and in the 
case of VFCV, in a particular setting, 



( SVP, ,x\ 2(T 2 4(7 4 

var {C Y {A) = + — r 



1 



V-l (V-l) 2 (V-l) 3 



+ o (n 
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The asymptotical variance of the VFCV estimator of the risk decreases with V, 
implying that LOO asymptotically has the minimal variance. 

Non-asymptotic closed-for m formulas of th e variance of the LPO estimator 
of the risk have been proved bv lCelissd ((2Q08b) in regression, for projection and 
kernel estimators for instanc e. On the vari ance of RLT in the regression setting, 
see the asymptotic results o f lGirardl (1998) for Nadaraya- Watson kernel estima- 
to rs, as well as the non-asym ptotic computations and simulation experiments 
bv lNadeau and Bengi o (2003) with several learning algorithms. 



Density estimation Non-asymptotic closed-form formulas of the vari a nce o f 
the LP O estimator of the risk have been proved by ICelisse and Robin (2008) 
and by ICelissel ( 2008a ) for projection and kernel estimators. In particular, the 
dependence of the variance of £ L PO on p has been quant ified explicitly for 
histogram and kernel estimators by Celisse and Robin ( 2008h . 



Classification For the simple problem of discriminatin g between two popu- 
lations with shifted distributions, Davison and Hal] 1 19921 ) showed that the gap 
between as ymptotic variances of LOO and bootstrap becomes larger when data 



are noisier. 



Nad eau and B engi ol l|2003t ) made non-asymptot ic computations and 



simulation experiments with several learning algorithms. lHastie et al.l (|200lf ) 
empirically showed that VFCV has a minimal variance for some 2 < V < n, 
whereas LOO usually has a large variance; this fact certainly depends on the 
stability of the algori thm considered, as showed by simulation experiments by 



Molinaro et~aD ( 2005) 



5.2.3 Estimation of the variance 

There is no universal — valid under all distributio ns — unbiased estimator 
of the variance of RLT dNadeau and Bengid . 2003 ) and VFCV estimators 
(|Bengio and Grandvaletl . 12004 ). In particular, iBengio and Grandvaletl (|2004l ) 
recommend the use of variance estimators taking into account the correlation 
structure between test errors; otherwise, the variance of CV can be strongly 
underestimated. 

Despite these negat ive results, (biased) estima tors of the variance of £ cv 
have b een pro posed by Nadeau and Bengiol (2003), by IBengio and Grandvaletl 
(|20041 ) and by iMarkatou et al.l (|2005l ) , and tested in simulation experiments in 
regression and classific ation. Furthermor e, in th e framework of density estima- 
tion with histograms, Celisse and Robin ( 2008h proposed an estimator of the 
variance of the LPO risk estimator. Its accuracy is assessed by a concentration 
in equality. These results have recently been extended to projection estimators 
bv lCelissd (<2008ah . 



6 Cross-validation for efficient model selection 



This section tackles the properties of CV procedures for model selection when 
the goal is estimation (see Section |2~2| . 
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6.1 Relationship between risk estimation and model se- 
lection 

As shown in Section l3TT| minimizing an unbiased estimator of the risk leads to an 
efficient model selection procedure. One could conclude here that the best CV 
procedure for estimation is the one with the smallest bias and variance (at least 
asymptot i cally) , for instance, LOO in the least-squares regression framework 
(iBurmanl . Tl98£ 



Nevertheless, the best CV estimator of the risk is not nec e ssarily the best 
model selection procedure. For instance, Breiman and Spector 1 1992( l observed 



that uniformly over the models, the best risk estimator is LOO, whereas 10- 
fold CV is more accurate for model selection. Three main reasons for such a 
difference can be invoked. First, the asymptotic framework (A fixed, n — > oo) 
may not apply to models close to the oracle, which typically has a dimension 
growing with n when s does not belong to any model. Second, as explained in 
Section 13.21 estimating the risk of each model with some bias can be beneficial 
and compensate the effect of a large variance, in particular when the signal-to- 
noise ratio is small. Third, for model selection, what matters is not that every 
estimate of the risk has small bias and variance, but more that 

sign(crit(mi) - crit(m 2 )) = sign(£ P (s mi ) - C P (s„ i2 )) 

with the largest probability for models m\,m<i near the oracle. 

Therefore, specific studies are required to evaluate the performances of the 
various CV procedures in terms of model selection efficiency. In most frame- 
works, the model selection performance directly follows from the properties of 
CV as an estimator of the risk, but not always. 

6.2 The global picture 

Let us start with the classification of model selection procedures made by IShaol 
(|l997h in the linear regression framework, since it gives a good idea of the 



performance of CV procedures for model selection in general. Typically, the 
efficiency of CV only depends on the asymptotics of n t /n : 

• When n t ~ n, CV is asymptotically equivalent to Mallows' C p , hence 
asymptotically optimal. 



When n t ~ An with A S (0, 1), CV is asymptotically equivalent to GIC K 
with k = 1+A -1 , which is defined as AIC with a penalty multiplied by k/2. 
Hence, such CV procedures are overpenalizing by a factor (1+A)/(2A) > 1. 



The above results have been proved bv lShad |l997) for LPO (see also 0. 119871 



for the L OO); they als o hold for RLT when B » n 2 since RLT is then equivalent 
to LPO (|Zhangl . [l993T ). 



In a general statistical framework, the model selection performance 
of MCCV, VFCV, LOO, LOO Bootstrap, and .632 bootstrap for se- 
lection am ong minimum contrast estim a tors was studied in a series 

of papers (Ivan der La an and Dudoitl . 12003b Ivan der Laan et all 12004 [2006; 



van der Vaart et all 120061 ): these results apply in particular to least-squares 
regression and density estimation. It turns out that under mild conditions, an 
oracle-type inequality is proved, showing that up to a multiplying factor C n — > 1, 
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the risk of CV is smaller than the minimum of the risks of the models with a 
sample size n t . In particular, in most frameworks, this implies the asymptotic 
optimality of CV as soon as n t ~ n. When n t ~ An with A € (0,1), this 
naturally generalizes Shao's results. 

6.3 Results in various frameworks 

This section gathers results about model selection performances of CV when 
the goal is estimation, in various frameworks. Note that model selection is con- 
sidered here with a general meaning, including in particular bandwidth choice 
for kernel estimators. 



Regression First, the results of Section 16.21 suggest that CV is suboptimal 
when m is not asymptotically equivalent to n. This f act has been p roved rigor- 
ously for VFCV when V = 0(1) with regressograms (|Arlotl . l2008d) : with large 
probability, the risk of the model selected by VFCV is larger than 1 + k(V) 
times the risk of the oracle, with k(V) > for every fixed V. Note however 
that the best V for VFCV is not the largest on e in every regression f rame- 
work, as shown empirically in linear regressio n ( Breiman and Spector . 19921 : 
Herzberg and Tsukanov . 19861 ): Breimanl 1 19961 ) proposed to explain this phe- 
nomenon by relating the stability of the candidate algorithms and the model 
selection performance of LOO in various regression frameworks. 

Second, the "universality" of CV has been confirmed by showing that it natu- 
rally adapts to heteroscedasticity of data when selecting among regressograms. 
Despite its suboptimality, VFCV wit h V = 0(1) satisfies a non- asymptotic 
oracle inequality with constant C > 1 (jArlotl . 12008a ). Furthermore, F-fold pe- 



nalization (which often coincides with corrected VFCV, see Section f5 . 1 . 2 1) sat- 
isfies a no n-asymptotic o racle inequality wi th C n — > 1 as n — > +oo, both when 
V = 0(1) ijArlotl . 120083 ) and when V = n (jArlotl . l2008ah . Note that n-fold pe- 
nalization is very close to LOO, suggesting that it is also asymptotically optimal 
with heteroscedastic data. Simulation experiments in the context of change- 
point detection confirmed that CV adapts well to heterosce dasticity, contrary 
to us ual model selection procedures in the same framework ijArlot and Celissd . 
20091) . 

The performances of C V have also been assessed for other kinds of es timators 
i n reg ression. For choosing the number of knots in spline smoothing, IBurmanI 
(|1990h proved that corrected versions of VFCV and RLT are asymptotically 
optimal provided n/(Bn v ) — 0(1). Furthermore, in kernel regre ssion, several 
CV m ethods h ave bee n comp ared to GCV in kernel regression bv lHardle et al.l 
( 19881 ) and bv iGirardl 1 19981 ): the conclusion is that GCV and related criteria 
are computationally more efficient than MCCV or RLT, for a similar statistical 
performance. 

Finall y, note that asymp totic results about CV in regression have been 
proved by Gvorfi et al.l ll2002l). a nd an oracle inequality with constant C > 1 has 
been proved by WegkampT i 20031 ) for the hold-out, with least-squares estimators. 



Density estimation CV perfo rms similarly than in regr ession for selecting 



among least-squares estimators (jvan der Laan et all 12004 ): It yields a risk 
smaller than the minimum of the risk with a sample size n t . In particular, 
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non-asymptoti c oracle inequalities with constant C > 1 have been proved by 
Celissd l|2008bl ) for the LPO when p/n <E [a, b], for some < a < b < 1. 



The performance of CV for selecting the bandwidth of kernel density esti- 
mators has been studied in sever al papers. W ith the least-squares contrast, the 
efficiency of LOO was p r oved b y [Hall (| 19831) and generalized to the multivari- 
ate framework by Stone 1 1984 ): an oracle inequal ity asymptotically leading to 
efficiency was recently proved by iDalelane 1 20051 ). With the Kullback-Leibler 
divergence, CV can suffer from troubles in perfo rming model selection (see also 
ISchuster and Gregory . Il98ll: IChow et all 1987 ). The influence of the tails of 
the target s was studied bv lHalll (|1987n . who gave conditions under which CV 
is efficient and the chosen bandwidth is optimal at first-order. 



Classification In the framework of binary classi fication by int e rvals (that is, 
with X = [0, 1] and piecewise constant classifiers) . iKearns et al. 1 199?t l proved 
an oracle inequality for the hold-out. Furthermore, empirical experiments show 
that CV yi elds (almost) a lways the best performance, compared to deterministic 
penalties feearns et al. . 1997 ). On the contrary, simulation experiments by 
Bartlett et al. l|2002l ) in the same setting showed that random penalties such as 



Rademacher complexity and maximal discrepancy usually perform much better 
than hold-out, which is shown to be more variable. 

Nevertheless, the hold-out still enjoys quite good theoretical properties: I t 
was proved to adapt to the margin condition by lBlanchard and Massart (2006), 
a property nea rly unachievable with usual model selection procedures (see also 
Massart , 20071 Section 8.5). This suggests that CV procedures are naturally 



adaptive to several unknown properties of data in the statistical learning frame- 
work. 

The performance of the LOO in bi nary classification was related to the 
stability of the candidate algorithms by iKearns and Ronl (jl999h : they proved 
oracle-type inequalities called "sanity-check bounds" , descr ibing the worst-case 
performance of LOO (see also Bousquet and Elissefi . 20021 ). 

An experimental comparison of several CV methods and bootstrap-based 
C V (in p a rticul ar .6 32+ bootstrap) in classificati on can also be found in papers 
bv lEfronl l|l986f ) and lEfron and Tibshiranl <|l997h . 



7 Cross-validation for identification 

Let us now focus on model selection when the goal is to identify the "true model" 
S mo , as described in Section l2~3l In this framework, asymptotic optimality is 
replaced by (model) consistency, that is, 

¥{m(D n )=m ) > 1 . 

n — >oo 

Classical model selection procedures built for identification, such as BIC, are 
described in Section l3~3l 

7.1 General conditions towards model consistency 

At first sight, it may seem strange to use CV for identification: LOO, which 
is the pioneering CV procedure, is actually closely related to the unbiased risk 
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estimation principle, which is only efficient when the goal is estimation. Fur- 
thermore, estimation and identification are somehow contradictory goals, as 
explained in Section [2~4l 

This intuition about inc onsist e ncy o f some CV procedures is confirmed by 
several theoretical results. Shao (|l993h proved that several CV methods are 
inconsistent for variable selection in linear regression: LOO, LPO, and BICV 
when liminfn-,00 (n t /n) > 0. Even if these CV methods asymptotically select 
all the true variables with probability 1, the probab ility t h at th ey select too 
much variables does not tend to zero. More generally. IShaol (1997) proved that 
CV procedures behave asymptotically like GICa„ with X n = 1 + n/n t , which 
leads to inconsistency as soon as n/nt — 

In the context of ordered variable selection in linear regression, IZhand {1993) 
computed the asymptotic value of the probability of selecting the true model 
for several CV procedures. He also numerically compared the values of this 
probability for the same CV procedures in a specific example. For LPO with 
p/n — ► A G (0, 1) as n tends to +oo, P (to = m ) increases with A. The result is 
slightly different for VFCV: P (m = mo ) increases with V (hence, it is maximal 
for the LOO, which is the worst case of LPO). The variability induced by the 
number V of splits seems to be more important here than the bias of VFCV. 
Nevertheless, P(m = mo) is almost constant between V = 10 and V — n, so 
that taking V > 10 is not advised for computational reasons. 

These results suggest that if the training sample size n t is negligible in front 
of n, then model consistency could be obtained. This has been confirmed theo- 
retically by Shad ( 19931 . 1997 ) for the variable selection problem in linear regres- 
sion: CV is consistent when n ^> n t — > oo, in particular RLT, BICV (defined in 
Section f4.3.2p and LPO with p = p n ~ n and n — p n — ► oo. 

Therefore, when the goal is to identify the true model, a larger proportion of 
the data should be put in the validation set in order to improve the perfor mance . 
This phenomenon is somewhat related to the cross-validation paradox (Yang, 
2006h . 



7.2 Refined analysis for the algorithm selection problem 

The behaviour of CV for identification is better understood by considering a 
more general framework, where the goal is t o select among stati stical algorithms 
the one with the fastest convergence rate. lYan 3 J2006L l2007h considered this 
problem for two candidate algorithms ( or more gene rally any finite number of 
algorithms). Let us mention here that IStone (1977) considered a few specific 
examples of this problem, and showed that LOO can be inconsistent for choosing 
the best among two "good" estimators. 

The conclusion of Yang's papers is that the sufficient condition on nt for 
the consistency in selection of CV strongly depends on the convergence rates 
(r n ,i ) i=1 2 °f the candidate algorithms. Let us assume that r„ 1 and r n ,2 differ 
at least by a multiplicative constant C > 1. Then in the regression framework, 
if the risk of s, is measured by E ||sj — s|| 2 , lYang 1 2007 ) proved that the hold- 
out, VFCV, RLT and LPO with voting (CV-v, see Section |4X2| are consistent 
in selection if 

n Vl n t ~^ 00 and y / n^maxr l nti i — ► 00 , (14) 
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under some conditions on \\si — s\\ for p = 2, 4, oo. In the cla ssification frame- 
work, if the risk of % is measured by P (si ^ s ) . Wan ej ( 2006h proved the same 
consistency result for CV-v under the condition 

n. u max,r 2 , 

n v ,n t — * 00 and ► 00 , (15) 

where s„ is the convergence rate of P (si(-D„) ^ S2(D n )). 

Intuitively, consistency holds as soon as the uncertainty of each estimate of 

— 1/2 

the risk (roughly proportional to n v ) is negligible in front of the risk gap 
\ r n t ,i — r n t ,2\ (which is of the same order as max^ r„ tj i). This condition holds 
either when at least one of the algorithms converges at a non-parametric rate, 
or when n ( <n, which artificially widens the risk gap. 

Em pirical results in the same direction were proved by iDietterichl (1998) 



and by Alpavdin (1999), leading to the advice that V — 2 is the best choice 



when V FCV is used for comparin g two learning procedures. See also the re- 
sults by lNadeau and Ben gio ( 20o3) about CV considered as a testing procedure 



comparing two candidate algorithms. 

The sufficient conditions lfl4|) and (fl5|) can be simplified depending on 
maxiTn^, so that the ability of CV to distinguish between two algorithms de- 
pends on their convergence rates. On the one hand, if max^r,^ cx n -1 / 2 , then 
(fl4| or (fl5| only hold when n v ^> nt — > 00 (under some conditions on s n in 
classification). Therefore, the cross-validation paradox holds for comparing al- 
gorithms converging at the parametric rate (model selection when a true model 
exists being only a particular case). Note that possibly stronger conditions can 
be required in classification where algorithms can converge at fast rates, between 
n~ x and rt" 1 / 2 . 

On the other hand, lfT4"|) and lfT5]) are milder conditions when max, r n ^ ^> 
n -i/2. They are implied by n t /n v = 0(1), and they even allow n t ~ n (under 
some conditions on s n in classification). Therefore, non-parametric algorithms 
can be compared by more usual CV procedures (nt > n/2), even if LOO is still 
excluded by conditions ([Ml and (fl5|) . 

Note that according to a simulation experiments, CV with averaging (that 
is, CV as usual) and CV with voting are equi valent at fir st but not at second 
order, so that they can differ when n is small (|YaneLl2nf)7h . 



8 Specificities of some frameworks 

Originally, the CV principle has been proposed for i.i.d. observations and usual 
contrasts such as least-squares and log-likelihood. Therefore, CV procedures 
may have to be modified in other specific frameworks, such as estimation in 
presence of outliers or with dependent data. 



8.1 Density estimation 

In the density estimation framework, some specific modifications of CV have 
been pro posed. 

First, lHall et al.1 l|l992l l defined the "smoothed CV", which consists in pre- 
smoothing the data before using CV, an idea related to the smoothed bootstrap. 
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They proved that smoothed CV yields an excellent asymptotical model selection 
performance under various smoothness conditions on the density. 

Secon d, when the goal is to es timate the density at one point (and not 
globally) . lHall and Schucairy ( 1989h proposed a local version of CV and proved 
its asymptotic optimality. 



8.2 Robustness to outliers 



In presence of outliers in regression, iLeungl ( 2005 ) studied how CV must be 
modified to get both asympt otic efficiency and a consistent bandwidth estimator 
(see also lLeung et al. . 19931 ), 

Two changes are possible to achieve robustness: Choosing a "robust" regres- 
sor, or choosing a robust loss-function. In presence of outliers class ical CV with 
a n on-rob u st los s function has been shown to fail by iHardle 1 19841 ). 

Leung (12005 ) descr ibed a CV procedure based on robust losses like L 1 and 



ones. 



Huber's (HubeJ [l964) 



The same strategy remains applicable to other 
setups like linear models in lRonchetti et al. 1 1997 ). 



8.3 Time series and dependent observations 

As explained in Section 14. 1\ CV is built upon the heuristics that part of the 
sample (the validation set) can play the role of new data with respect to the 
rest of the sample (the training set). "New" means that the validation set is 
independent from the training set with the same distribution. 

Therefore, when data £i,...,£ n are n °t independent, CV must be modi- 
fied, like other model selection pr ocedures (in non-para metric regression with 
dependent data, see the review bv lOpsomer et al. . 2001 ). 



Let us first consider the statistical framework of Section Q] with £1, . . . ,£„ 
identically distribute d but not independent. Then, when for instance data are 
positively correlated, Hart and Wehrlv ( 19861 ) proved that CV overfits for choos- 
ing the bandwidth of a kern el estimator in regression (see also Chu and MarronL 



19911 : lOpsomer et all [2001) . 



The main approach used in the literature for solving this issue is to choose 
and such that mi^ eJ (t) je/w N — j\ > h> 0, where h controls the dis- 
tance from which observations i and j are independent. For instance, the LOO 
can be changed into: 1^ = { J} where J is uniformly chosen in {1, . . . ,n}, 

an d J (t) = {1 J- h - 1, J+h + l,...,n},a method called "modified CV" 

by IChu and Marron 1 1991 ) in the context of bandwidth selection. Then, for 



short range dependences, £j is almost independent from £j when \i — j\ > h is 
large enough, so that (Ci),- e j(*) m almost independent from (£j)jerM" Several 
asy mptotic optima l ity re sults have been proved on modified CV, for instance 
by lHart and" Vieul (|l990h for bandwidth choice in kernel density estimation, 
when data are a-mixing (hence, with a short range dependence structure) and 
h = h n — ► 00 "not too fast". Note that modified CV also enjoy s some asymptoti c 
optimality results with long-range dependences, as proved by lHall et all {1995), 
even if an alternative block bootstrap method seems more appropriate in such 
a framework. 

Se veral alternatives to modified CV have also been proposed. The "/i-block 
CV" l|Burman et all |l994J) is modified CV plus a corrective term, similarly to 
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the bias-corrected CV by Burman 1 19891 ) (see Section l57lj) , Simulation experi- 



ments in several (short range) dependent frameworks show that this corrective 
term matters when h/n is not small, in particular when n is small. 



The "partitioned CV" has been proposed by IChu and Marronl ljl99lh for 



bandwidth selection: An integer g > is chosen, a bandwidth is chosen by 
CV based upon the subsample (^k+gj ) J>0 for each k — l,...,g, and the selected 

bandwidth is a combination of (A/,.). 



When a parametric model is available for the dependency structure, I Hart 



(1994) proposed the "time series CV". 



An important framework where data often are dependent is time-series anal- 
ysis, in particular when the goal is to predict the next observation £„ + i from 
the past £i , . . . , £„. When data are stationary, /i-block C V and similar ap- 
proaches can be us e d to deal with (short range) dependences. Nevertheless, 
Burman and Nolanl ([1992) proved in some specific framework that unaltered 



CV is asymptotic optimal when £i, . . . , £„ is a stationary Markov process. 

On the contrary, using CV for non-stationary time-series is a quite difficult 
problem. The only reasonable approach in general is the hold-out, that is, 
= {l,...,m} and = { m + 1, . . . , n} for some deterministic m. Each 
model is first trained with (£j ) Jg7 ( t ) • Then, it is used for predicting successively 
£ m +i from (fi)j< m) from 

<m+i' anc ^ so on ' ^he m °del with the 
smallest average error for predicting (£j)jg/M from the past is chosen. 

8.4 Large number of models 

As mentioned in Section [3J model selection procedures estimating unbiasedly 
the risk of each mod el fail when, in particula r, the number of models grows 



exponentially with n (|Birge and Massartl . 120071 1 . Therefore, CV cannot be used 



dire ctly, except ma ybe with n t <C n, provided n t is well chosen (see Section [6] 
and lCelissel . l2008bl . Chapter 6). 



For least-squares regression with homoscedastic data, IWegkamnl (l2003h pro- 
posed to add to the hold-out estimator of the risk a penalty term depending 
on the number of models. This method is proved to satisfy a non-asymptotic 
oracle inequality with leading constant C > 1. 

Another general approach was proposed by lArlot and Celissd (2009) in the 



context of multiple change-point detection. The idea is to perform model se- 
lection in two steps: First, gather the models {S m ) m&Mn into meta-models 

(Sd)d£T>„, where V n denotes a set of indices such that Card(2?„) grows at 
most polynomially with n. Inside each meta- model Sd = V) m eM (D) ^ D ls 
chosen from data by optimizing a given criterion, for instance the empirical con- 
trast Cp n (t), but other criteria can be used. Second, CV is used for choosing 
among (sb ) DeT , n ■ Simulation experiments show this simple trick automatically 
takes into account the cardinality of M. n , even when data are heteroscedastic, 
contrary to other model selection procedures built for exponential collection of 
models which all assume homoscedasticity of data. 
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9 Closed-form formulas and fast computation 



Resampling strategies, like CV, are known to be time consuming. The naive im- 
plementation of CV has a computational complexity of B times the complexity 
of training each algorithm A, which is usually intractable for LPO, even with 
p = 1. The computational cost of VFCV or RLT can still be quite costly when 
B > 10 in many practical problems. Nevertheless, closed-form formulas for 
CV estimators of the risk can be obtained in several frameworks, which greatly 
decreases the computational cost of CV. 

In de n sity e stimation closed-fo r m for mulas have been originally derived by 
Rudemol (1982) and by Bowmanl (1984) for the LOO risk estimator of his- 
tograms and kerne l estim ators. These results have been recently extended by 
Celisse and Robin 1 2008f ) to the LPO risk estimator with the quadratic loss. 
Si milar results are more generally available for projection estimators as settled 
bv ICelissd (|2Q08ah . Intuitively, such formulas can be obtained provided the 
number N of values taken by the B = (J 1 ) hold-out estimators of the risk, 
corresponding to different data splittings, is at most polynomial in the sample 
size. 



For least-squares estimators in linear regression, IZhand (l993) proved a 
closed-form form ula for t he LOO es timator of the risk. Similar result s have 
been obtained bv lWahbal <|l975l . Il977h . and byE raven and Wahhal l|l979l l in the 
spline smoothing context as well. These papers led in particular to the definition 
of GCV (see Section 14. 3. 3p and related procedures, which are often used instead 
of CV (with a nai ve implementa tion) because of their small computational cost, 
as emphasized by Girardl 1 19981 ) . 

Clo s ed-form formulas for the LPO estimator of the risk were also obtained by 
Celisse! (2008b) in regression for kernel and projection estimators, in particular 
for regressograms. An important property of these closed-form formulas is their 
additivity: For a regressogram associated to a partition (Ix)^^ of X, the 
LPO estimator of the risk can be written as a sum over A e A m of terms 
which only depend on observations (Xj,Yj) such that Xj € I\ . Therefore 



dynamic programming ( Bellman and Drey fus. 1962) can be used for minimizing 
the LPO esti mator of the risk oyer the set of partitions of X in D pieces. As 
an illustration, Arlot and Celisse ( 2009h successfully applied this strategy in the 
change-point detection framework. Note that the same idea can be used with 
VFCV or RLT, but for a larger computational cost since no closed-form formulas 
are available for these CV methods. 

Finally, in frameworks where no closed-form formula can be proved, some 



efficient algorithms exist for avoiding to recompute £ H ° (.4; D n ; 1^' ) from 
scratch for each dat a splitt i ng 1-^ . These algorithms rely on updating formulas 



such as the ones by lRiplev (1996) for LOO in linear and quadratic discriminant 
analysis; this approach makes LOO as expensive to compute as the empirical 
risk. 

Very similar formulas are als o available for LOO and t he fc- nearest neigh- 
bours estimator in classification ( Daudin and Mary-Huardl . 20081 ) . 
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10 Conclusion: which cross-validation method 



for which problem? 

This conclusion collects a few guidelines aiming at helping CV users, first in- 
terpreting the results of CV, second appropriately using CV in each specific 
problem. 

10.1 The general picture 

Drawing a general conclusion on CV methods is an impossible task because of 
the variety of frameworks where CV can be used, which induces a variety of 
behaviors of CV. Nevertheless, we can still point out the three main criteria to 
take into account for choosing a CV method for a particular model selection 
problem: 

• Bias: CV roughly estimates the risk of a model with a sample size n t < n 
(see Section EH]). Usually, this implies that CV overestimates the variance 
term compared to the bias term in the bias-variance decomposition J2]) 
with sample size n. 

When the goal is estimation and the signal-to-noise ratio (SNR) is large, 
the smaller bias usually is the better, which is obtained by taking n t ~ n. 
Otherwise, CV can be asymptotically suboptimal. Nevertheless, when the 
goal is estimation and the SNR is small, keeping a small upward bias for 
the variance term often improves the performance, which is obtained by 
taking n t ~ nn with k € (0, 1). See Section [6l 

When the goal is identification, a large bias is often needed, which is 
obtained by taking n t <C n; depending on the framework, larger values of 
n t can also lead to model consistency, see Section [7l 

• Variability: The variance of the CV estimator of the risk is usually a 
decreasing function of the number B of splits, for a fixed training size. 
When the number of splits is fixed, the variability of CV also depends 
on the training sample size n t . Usually, CV is more variable when n t is 
closer to n. However, when B is linked with n t (as for VFCV or LPO), 
the variability of CV must be quantified precisely, which has been done in 
few frameworks. The only general conclusion on this point is that the CV 
method with minimal variability seems strongly framework-dependent, see 
Section [5?2l for details. 

• Computational complexity: Unless closed-form formulas or analytic ap- 
proximations are available (see Section [9]), the complexity of CV is roughly 
proportional to the number of data splits: 1 for the hold-out, V for VFCV, 
B for RLT or MCCV, n for LOO, and (™) for LPO. 

The optimal trade-off between these three factors can be different for each prob- 
lem, depending for instance on the computational complexity of each estimator, 
on specificities of the framework considered, and on the final user's trade-off 
between statistical performance and computational cost. Therefore, no "opti- 
mal CV method" can be pointed out before having taken into account the final 
user's preferences. 
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Nevertheless, in density est imation, closed-form expr essions of the LPO es- 
timator have been derive d by Celisse and Robinl 1 2008t ) with histograms and 
kernel estimators, and by Celissd l |2008al ) for proiection estimators. These ex- 
pressions allow to perform LPO without additional computational cost, which 
reduces th e aforementioned t r ade-o ff to the easier bias-variability trade-off. In 
particular, Celisse and Robin 1 20081 ) proposed to choose p for LPO by minimiz- 
ing a criterion defined as the sum of a squared bias and a variance terms (see 



also lPolitis et all 1 19991 Chapter 9). 



10.2 How the splits should be chosen? 

For hold-out, VFCV, and RLT, an important question is to choose a particular 
sequence of data splits. 

First, should this step be random and independent from D n , or take into 
account some features of the problem or of the data? It is often recommended 
to take into account the structure of data when choosing the splits. If data 
are stratified, the proportions of the different strata should (approximately) 
be the same in the sample and in each training and validation sample. Be- 
sides, the training samples should be chosen so that 'Sm( Dn^) is well de fined 
for every training set; in the regressogram case, this led lArlot ( 2008c ) and 



tor every training s et; m tne regressogram case, tms led lArlotl (|zUU8d l and 
Arlo t and Celisse! (2009) to choose carefully the splitting scheme. In supervised 



classification, practitioners usually choose the splits so that the proportion of 
each class is the same in every v alidation sample as in the sample. Neverthe- 
less, Breiman and Spector (1992) made simulation experiments in regression for 



comparing several splitting strategies. No significant improvement was reported 
from taking into account the stratification of data for choosing the splits. 

Another question related to the choice of (Ij)i<j<B is whether the /• 
should be independent (like MCCV), slighly dependent (like RLT), or strongly 
dependent (like VFCV). It seems intuitive that giving similar roles to all data 
points in the B "training and validation tasks" should yield more reliable results 
as other methods. This intuit ion may exp lain why VFCV is much more used 
than RLT or MCCV. Similarly, [Shac] l|l993f l proposed a CV method called BICV, 



where every point and pair of points appear in the same number of splits, see 
Section T4.3.2I Nevertheless, most recent theoretical results on the various CV 
procedures are not accurate enough to distinguish which one may be the best 
splitting strategy: This remains a widely open theoretical question. 

Note finally that the additional vari ability due to th e choic e of a sequence of 
da ta splits was quanti fi ed em pirically bv ljonathan et aL ( 2000h and theoretically 
bv lCelisse and Robinl l|2008l ) for VFCV. 



10.3 V-fold cross-validation 

VFCV is certainly the most popular CV procedure, in particular because of 
its mild computational cost. Nevertheless, the question of choosing V remains 
widely open, even if indications can be given towards an appropriate choice. 

A specific feature of VFCV — as well as exhaustive strategies — is that choos- 
ing V uniquely determines the size of the training set n t = n(V — l)/V and 
the number of splits B = V, hence the computational cost. Contradictory 
phenomena then occur. 
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On the one hand, the bias of VFCV decreases with V since n t — n{\ — 1/V) 
observations are used in the training set. On the other hand, the variance of 
VFCV decreases with V for small values of V, whereas the LOO (V — n) is 
known to suffer from a high variance in several frameworks such as classification 
or density estimation. Note however that the variance of VFCV is minimal for 
V = n in some frameworks like linear regression (see Section l5T2]l . Furthermore, 
estimating the variance of VFCV from data is a difficult problem in general, see 
Section IBIHI 

When the goal of model selection is estimation, it is often reported in the 
literature that the optimal V is between 5 and 10, because the statistical perfor- 
mance does not increase much for larger values of V, and a veraging over 5 or 10 



splits remains computationally feasible ijHastie et al J . 12001] , Section 7.10). Even 



if this claim is clearly true for many problems, the conclusion of this survey is 
that better statistical performance can sometimes be obtained with other values 
of V, for instance depending on the SNR value. 

When the SNR is large, the asymptotic comparison of CV procedures re- 
called in Section 16.21 can be trusted: LOO performs (nearly) unbiased risk es- 
timation hence is asymptotically optimal, whereas VFCV with V — 0(1) is 
suboptimal. On the contrary, when the SNR is small, overpenalization can 
improve the performance. Therefore, VFCV with V < n can yield a smaller 
risk than LOO thanks to it s bias and de spite its variance when V is small (see 
simulation experiments by lArlotl . l2008ch . Furthermore, other CV procedures 
like RLT can be interesting alternatives to VFCV, since they allow to choose 
the bias (through n t ) independently from B, which mainly governs the variance. 
Another possible alternative is V-fold penalization, which is related to corrected 
VFCV (see Section \MM ■ 

When the goal of model selection is identification, the main drawback of 
VFCV is that n t <C n is often required for choosing consistently the true model 
(see Section [7]), whereas VFCV does not allow n t < n/2. Depending on the 
frameworks, different (empirical) recommandations for choosing V can be found 
in the literature. In ordered variable selection, the larg est V seems t o be the 
better, V = 10 prov i ding r esult s close to the opt imal ones ( Zhangl . 19931 ). On the 
contrary, Dietterich ( 19981 ) and Alpavdin ( 1999h recommend V = 2 for choosing 
the best learning procedures among two candidates. 



10.4 Future research 

Perhaps the most important direction for future research would be to provide, 
in each specific framework, precise quantitative measures of the variance of CV 
estimators of the risk, depending on n t , the number of splits, and how the 
splits are chosen. Up to now, only a few precise results have been obtained 
in this direction, for some specific CV methods in linear regression or density 
estimation (see Section 15, 2ft . Proving similar results in other frameworks and 
for more general CV methods would greatly help to choose a CV method for 
any given model selection problem. 

More generally, most theoretical results are not precise enough to make any 
distinction between the hold-out and CV methods having the same training 
sample size n t , because they are equivalent at first order. Second order terms 
do matter for realistic values of n, which shows the dramatic need for theory 
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that takes into account the variance of CV when comparing CV methods such 
as VFCV and RLT with n t = n(V-l)/V but B^V. 
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